Method for the Interoperation of Virtual Organizations

ABSTRACT

A cooperative data stream processing system is provided that utilizes a plurality of independent, autonomous and possibly heterogeneous sites in a cooperative arrangement to process user-defined job requests over dynamic, continuous streams of data. A method is provided to organize the distributed sites into a plurality of virtual organizations that can be further combined and virtualized into virtualized virtual organizations. These virtualized virtual organizations can also include additional distributed sites and existing virtualized virtual organizations and all members of a given virtualized virtual organization can share data and processing resources in order to process jobs on either a task-based or goal-based allocation mechanism. The virtualized virtual organization is created dynamically using ad-hoc collaborations among the members and is arranged in either a federated or cooperative architecture. Collaborations between members is either tightly-coupled or loosely coupled. Flexible management of resources is provided with resources being provided under exclusive control or based on best-effort access.

CROSS-REFERENCE TO RELATED APPLICATIONS

The present application is a continuation-in-part of co-pending andco-owned U.S. patent application Ser. No. 11/733,684 filed Apr. 10,2007, a continuation-in-part of co-pending and co-owned U.S. patentapplication Ser. No. 11/733,732 filed Apr. 10, 2007, and acontinuation-in-part of co-pending and co-owned U.S. patent applicationSer. No. 11/733,724 filed Apr. 10, 2007. The entire disclosures of allthree patent applications are incorporated herein by reference.

STATEMENT REGARDING FEDERALLY SPONSORED RESEARCH

The invention disclosed herein was made with U.S. Government supportunder Contract No. H98230-05-3-0001 awarded by the U.S. Department ofDefense. The Government has certain rights in this invention.

FIELD OF THE INVENTION

The present invention relates to data analysis in continuous datastreams.

BACKGROUND OF THE INVENTION

Systems for processing streams of data utilize continuous streams ofdata as inputs, process these data in accordance with prescribedprocesses and produce ongoing results. Commonly used data processingstream structures perform traditional database operations on the inputstreams. Examples of these commonly used applications are described inDaniel J. Abadi et al., The Design of the Borealis Stream ProcessingEngine, CIDR 2005—Second Biennial Conference on Innovative Data SystemsResearch (2005), Sirish Chandrasekaran et al., Continuous DataflowProcessing for an Uncertain World, Conference on Innovative Data SystemsResearch (2003) and The STREAM Group, STREAM: The Stanford Stream DataManager, IEEE Data Engineering Bulletin, 26(1), (2003). In general,systems utilize traditional database structures and operations, becausestructures and operations for customized applications are substantiallymore complicated than the relational database paradigm. The reasons forthis comparison are illustrated, for example, in Michael Stonebraker,Ugur çetintemel, and Stanley B. Zdonik, The 8 Requirements of Real-TimeStream Processing, SIGMOD Record, 34(4):42-47, (2005).

These systems typically operate independently and work only with theprocessing resources contained within a single system to analyze streamsof data that are either produced by or directly accessible by the singlesite. Although multiple sites can be used, these sites operateindependently and do not share resources or data.

SUMMARY OF THE INVENTION

Systems and methods in accordance with the present invention provide fornegotiated cooperation among a plurality of independent sites to sharedata and processing resources in order to process user-definedinquiries, i.e., formal specifications of desired end results of theuser, over continuous dynamic streams of data. In accordance with oneexemplary embodiment, the present invention is directed to a method forcooperative data stream processing that includes identifying two or moredistributed sites. Each site contains the components, either within asingle node or location or distributed across the site, capable ofindependently processing continuous dynamic streams of data. Therefore,each site can process data independent of other sites within the system.The system can optionally contain sites that are of more limitedprocessing capacity. The sites can be heterogeneous, homogeneous or somecombination of heterogeneous and homogeneous sites. As used herein,heterogeneity or homogeneity among sites is based upon whether there aredifferences in execution environments of the sites, including but notlimited to aspects such as available applications, data type systems andsecurity and privacy policies.

The method facilitates the sharing among the sites of data, from primaland derived data sources, including continuous dynamic data streams,resource, including processing resources, and combinations thereof.Suitable processing resources include, but are not limited to, centralprocessing unit resources, memory resource, storage resources, softwareresources, hardware resources, network bandwidth resources, executionresources and combinations thereof. In one embodiment, facilitating thesharing includes negotiating peering relationships among the sites. Eachpeering relationship contains a description of the data and theresources shared by one or more sites and a level of autonomy maintainedby these sites. Suitable peering relationships include cooperativepeering relationships and federated peering relationships. In oneembodiment, facilitating the sharing among sites includes using commoninterest polices to define relationships between sites. Each commoninterest policy identifies data and resources to be shared between thesites and processing that each site is willing to perform on the data,for example on behalf of the other sites.

In one embodiment, facilitating the sharing among sites includes using aresource awareness engine or resource awareness manager in communicationwith each one of a plurality of data source and resource stores toobtain resources and data from a first site and to communicate theseresources and data to one or more second sites. These data source andresource stores include relational and semantic databases.

Having identified the sites and facilitated the sharing of data andresources among the sites, at least one of the distributed sites havingaccess to the shared data or resources is used to process user-definedinquiries over continuous dynamic streams of data. In order to use thesites to process user-defined inquiries, data from a plurality of remotesites can be communicated to a single home site, data can be processedat each one of a plurality of home sites before communicating theprocessed data to a single home site, effective ownership of datadisposed at one or more remote sites can be transferred to a single homesite and remotes sites can be used to schedule processing of data.

In one embodiment, using the distributed sites to process user-definedinquiries includes identifying from each inquiry at least onedistributed plan that is translated into a job for each user-definedinquiry such that each job utilizes data and processing resources fromone or more of the sites and executing each job on one of the identifiedsites. In one embodiment, each job includes a plurality ofinterconnected processing elements and identification of one or morejobs includes identifying the processing elements associated with eachjob. In addition, execution of each job includes building one or moresubjobs or applications containing identified processing elements fromone or more jobs and executing each subjob on one of the identifiedsites. The method also includes managing the execution of the processingelements on the distributed sites. In one embodiment, processing demandsare transferred from a first site to a second site in order tofacilitate processing of the job components.

The present invention is also directed to a cooperative data streamprocessing system containing two or more distributed sites. Eachdistributed site is in communication with other sites and contains anindependent instance of a data stream processing environment. The systemalso includes a plurality of peering relationships among the sites tofacilitate cooperation among the sites for sharing data and processingresources. In one embodiment, each independent instance of the datastream processing environment includes a stream processing core tomanage the distributed execution of subjobs on the site, a scheduler tocontrol flow of data and resources between sites, a storage managementsystem to control data to be persisted and a planner to assemble thesubjobs to be executed on the site based on user-defined inquiries.

In one embodiment, each independent instance of the data streamprocessing environment contains a complete instance of a systemarchitecture that facilitates receipt of user-defined inquiries,processing these user-defined inquiries on continuous data streams usingthe sites and communicating results of the processing. Suitable systemarchitectures include a user experience layer to interface with users toaccept the user-defined job inquiries and to delivery the processingresults, an inquiry services layer in communication with the userexperience layer to facilitate descriptions of the user-definedinquiries, a job planner disposed within the inquiry services layer, thejob planner is capable of producing one or more jobs associated witheach inquiry and capable of fulfilling the job, a job managementcomponent in communication with the job planner capable of executing thejobs using the sites and a stream processing core to manage theexecution of the jobs on the sites and to deliver the processing resultsto the user experience layer. In one embodiment, the architecture alsoincludes a data source management component in communication with thejob planner. The data source management component is capable of matchingdata streams to jobs.

In accordance with one exemplary embodiment, the present invention isdirected to a method for creating an interoperation of virtualorganizations in a cooperative data stream processing system. Aplurality of distributed sites is identified. Each site includescomponents capable of independently processing continuous dynamicstreams of data. In addition, a plurality of virtual organizations isidentified. Each virtual organization includes a combination of sitesselected from the identified plurality of distributed sites andconfigured to share at least one of data and processing resources withinthe combination of sites. A first interoperation of virtualorganizations containing a first group of virtual organizations selectedfrom the identified plurality of virtual organizations is created. Eachvirtual organization within the first interoperation of virtualorganizations is configured to share at least one of data and resourceswith other members of the first group of virtual organizations. In oneembodiment, the first interoperation of virtual organizations alsoincludes additional sites selected from the identified plurality ofdistributed sites. In one embodiment, the virtual organizations in thefirst group of virtual organizations are heterogeneous.

In one embodiment, the first interoperation of virtual organizationsalso includes at least one existing interoperation of virtualorganizations capable of sharing at least one of data and resources withthe first interoperation of virtual organizations. Each existinginteroperation of virtual organizations includes at least one of membervirtual organizations and member sites. In one embodiment, a secondinteroperation of virtual organizations is created containing a secondgroup of virtual organizations such that at least one of the pluralityof identified virtual organization is a member of both the first andsecond interoperation of virtual organizations. In one embodiment, alead virtual organization is identified in the first group of virtualorganizations, and the identified lead virtual organization is used tomanage the first interoperation of virtual organizations.

In one embodiment, common interest policies among the virtualorganizations in the first group of virtual organizations are used todefine relationships among the virtual organizations in the first groupof virtual organizations. This use includes using common interestpolicies to define the interoperation of virtual organizations, toidentify the first group of virtual organizations, to identify resourceallocation constraints for each virtual organization within the firstgroup of virtual organizations, to identify sharing relationships amongvirtual organizations within the first group of virtual organizations,to provide heterogeneity mapping between virtual organizations in thefirst group of virtual organizations, to provide communication detailsbetween virtual organizations in the first group of virtualorganizations or combinations thereof.

In one embodiment, the first group of virtual organizations constitutesa cooperative architecture. In this embodiment, resources in the firstgroup of virtual organizations are allocated directly from a virtualorganizations containing resources to be allocated to virtualorganizations requesting the allocated resources. In one embodiment, thefirst group of virtual organizations constitutes a federatedarchitecture. In this embodiment, a lead virtual organization in thefirst group of virtual organizations is identified for the federatedarchitecture. In one embodiment, the lead virtual organization is usedto control allocations of the shared data and resources among allvirtual organizations in the first group of virtual organizations.

In one embodiment where the first interoperation of virtualorganizations includes existing interoperations of virtualorganizations, each existing interoperation of virtual organizationsconstitutes a federated architecture. In one embodiment, lead virtualorganizations are identified within each existing interoperation ofvirtual organizations and used to participate in the firstinteroperation of virtual organizations. In one embodiment, use of theidentified lead virtual organizations to participate in the firstinteroperation of virtual organizations includes communicating requestsfor shared data and resources and data from a first interoperation ofvirtual organizations lead site to the identified lead virtualorganizations in the existing interoperations of lead sites and usingthe identified lead virtual organizations to allocate the requestedshared data and resources from member sites in the existinginteroperations of lead sites. In one embodiment, each existinginteroperation of virtual organizations constitutes a federatedarchitecture, and member virtual organizations of the existinginteroperations of virtual organizations are exposed directly to thefirst interoperation of virtual organizations. In one embodiment, dataand resources in the member virtual organizations are allocated directlyfrom the member virtual organizations to the first interoperation ofvirtual organizations.

In one embodiment, creation of the first interoperation of virtualorganizations includes utilizing a plurality of ad-hoc collaborationsbetween the virtual organizations in the first group of virtualorganizations to create the first interoperation of virtualorganizations dynamically.

BRIEF DESCRIPTION OF THE DRAWINGS

FIG. 1 is a schematic representation of an embodiment of a systemarchitecture for use on all sites within the cooperative data processingsystem of the present invention;

FIG. 2 is a schematic representation of an embodiment of peeringrelationships among sites within the cooperative system;

FIG. 3 is a schematic representation of the system architecture incombination with an embodiment of multi-site system functions;

FIG. 4 is a schematic representation of an embodiment of inquiryprocessing using the cooperative data processing system of the presentinvention;

FIG. 5 is a schematic representation of an embodiment of sitearrangements to provide for inter-site system failover;

FIG. 6 is a schematic representation of an embodiment of the deploymentof a distributed plan for the execution of jobs in the cooperative datastream processing system of the present invention;

FIG. 7 is a schematic representation of an embodiment of a virtualizedvirtual organization in accordance with the present invention; and

FIG. 8 is a schematic representation of an embodiment of collaborationbetween a plurality of virtualized virtual organizations

DETAILED DESCRIPTION

Systems and methods in accordance with the present invention provide forthe inter-cooperation of multiple, autonomous, distributed streamprocessing sites. Each individual stream processing site is capable ofprocessing a continuous dynamic flow of information that is createdinternally at that site or that originates from sources external to thatsite. Important or relevant information is extracted from a continuousstream containing voluminous amounts of unstructured and mostlyirrelevant data. Processing of data streams in accordance with thepresent invention is utilized in analyzing financial markets, forexample predicting stock value based on processing streams of real-worldevents, supporting responses to natural disasters such as hurricanes andearthquakes, for example based on the movement of rescue vehicles,available supplies or recovery operations and in processing sensor data.Examples of sensor data that can be analyzed include data on volcanicactivity as described in G. Werner-Allen et al., Deploying a WirelessSensor Network on an Active Volcano, IEEE Internet Computing, 10(2):18-25 (2006) and telemetry from radio telescopes as described in T.Risch, M. Koparanova and B. Thide, High-performance GRID DatabaseManager for Scientific Data, Proceedings of 4^(th) Workshop onDistributed Data & Structures (WDAS-2002), Carleton Scientific (Publ),2002.

Exemplary embodiments of cooperative data processing systems inaccordance with the present invention provide for rapid systemreconfiguration. The system adjusts quickly to the changing requirementsand priorities of users and administrators. As the system adjusts, itsimultaneously identifies and incorporates new input streams into itsprocessing and manages the loss of existing data sources or processingcapacity.

Cooperative data stream processing systems in accordance with thepresent invention function well under high load. In one embodiment, thesystem is assumed to be in a constant state of overload and mustcontinually adjust its resource allocations to support the highestpriority activities. Applications utilizing exemplary embodiments of thesystem for cooperative data stream processing in accordance with thepresent invention contain significant resilience to variations inprocessing resources, missing data and available input streams, amongothers. The missing data include data that is replaced by more importantdata as described in Fred Douglis et al., Short Object Lifetimes Requirea Delete-Optimized Storage System, Proceedings of 11^(th) ACM SIGOPSEuropean Workshop (2004).

Exemplary systems for cooperative data stream processing in accordancewith the present invention are typically heterogeneous. A given systemfor cooperative data stream processing contains a plurality ofdistributed sites. In one embodiment, each site is autonomous. Certainsites include substantial processing capacity, for example, thousands ofprocessing nodes and terabytes to petabytes of storage. Other siteswithin the system have limited resources. Sites with limited resourcesmay provide specialized or specific tasks such as data acquisition.Although two or more sites can be operated by a single domain ororganization, each one of the plurality of sites is preferablycompletely autonomous and can vary significantly in executionenvironment, policies and goals. The extent and type of cooperationprovided by each autonomous site varies based on the structure andcompatibility of any given set of sites.

Cooperative data stream processing systems in accordance with thepresent invention include a stream processing core to manage thedistributed execution of software components of applications, anano-scheduler to control the traffic flow between processing elements,a storage management system to control the data to be persisted in thestorage system based on retention values, a planner to assemble subjobsbased on user requests and available software components and a securityenforcement architecture. In general, the plurality of sites that arecontained with the cooperative data stream processing systems cooperate.The resultant interactions are supported and balanced against otherrequirements and challenges including autonomy, privacy and securityconstraints and differences in execution environments among the varioussites.

Exemplary systems in accordance with the present invention utilizecooperation among the various sites. This cooperation takes severalforms. Sites cooperate by exchanging data. Each site can pass primaldata streams on to other sites that need to analyze the same input data.Primal data streams are data streams that are brought into one site fromoutside the system. In addition, each site can pass derived data streamson to other sites. Derived data streams are data streams that arecreated within a site using analysis of other streams, for exampleprimal data streams. Sites also cooperate by sharing resources such asexecution resources, software resources and hardware resource, amongothers, in order to handle processing overloads. Overloads result fromsudden increases in the system workload or sudden decreases in availableresources, for example due to partial failure of a given site. In thecase of a complete failure of a given site, cooperation provides for theshifting of important processing to another site. Cooperation alsoprovides for access to specialized resources, for example devices andservices, that are unique to certain sites.

Referring initially to FIG. 1, an exemplary embodiment of anarchitecture 100 for the cooperative data stream processing systems ofthe present invention is illustrated. The architecture includes aplurality of layers. This first or highest layer is the user experience(UE) layer 110. The UE layer provides the interface between thecooperative data stream processing system and users 111 of the system.Each user interacts with the system through an interface such as agraphical user interface (GUI) on a computing system in communicationwith one or more of the plurality of sites within the system. Throughthis interface, each user presents inquiries 115 to the system that thesystem processes through one or more primal or derived data streamsusing the cooperating sites within the system. In one embodiment, theseinquiries are converted to high-level queries. An example of ahigh-level query is to provide a listing containing the locations of allbottled water reserves within a hurricane relief area. The UE layer 110is also used by the cooperative data stream processing system to deliverthe query results through the UE to the requesting user.

In communication with the UE layer is the inquiry services (INQ) layer112. The INQ layer facilitates the description of a user's job requestand the desired final results in a pre-determined high level language.These high level languages are used to depict the semantic meaning ofthe final results and to specify user preferences such as which datasources to include in or to exclude from the plan. The INQ layerincludes a job planner 113 subcomponent that determines or identifies,based on the user-defined inquiries as expressed in the appropriate highlevel language, appropriate primal or derived data sources andprocessing elements (PEs) that can achieve the desired goals of theinquiry. A job contains a composition of data sources and processingelements interconnected in a flow graph. The job planner subcomponentsubmits the produced jobs to the job management component 116 forexecution. The job planner subcomponent, in defining the jobs, takesinto account various constraints, for example, available input datasources, the priority of the user-defined inquiry, processing availableto this inquiry relative to everything else being produced by the systemand privacy and security constraints, among other factors. Examples ofsuitable planner components are described in Anton Riabov and Zhen Liu,Planning for Stream Processing Systems, Proceedings of AAAI-2005, July2005 and Anton Riabov and Zhen Liu, Scalable Planning for DistributedStream Processing Systems, Proceedings of ICAPS 2006, June 2006.

In one embodiment, the cooperative data stream processing systemincludes a data source management (DSM) component 114 in communicationwith the INQ layer and the job planner. Since there are many possibledata streams that a job can process, including both primal streams fromoutside the system and derived streams created by sites within thesystem, the DSM component matches jobs, i.e. from user-definedinquiries, with appropriate data streams. In order to match jobs withdata streams, the DSM component utilizes constraints specified in theuser-defined inquiries. These constraints include, but are not limitedto, data type constraints and source quality constraints. The DSMcomponent returns data source records that provide information to accessthese data sources. In one embodiment, the INQ layer and job planner usethe DSM component to formulate job execution plans, which are thensubmitted to lower levels of the system.

In response to user-defined inquiries and in combination with the datasource records provided from the DSM component, the job plannerformulates distributed plans that are translated into one or more jobs117 to be executed within the system and delivers these jobs to the jobmanagement layer 116 of the system. Each job identified by the jobplanner subcomponent contains a plurality of interconnected PEs 119. Inone embodiment, incoming data stream objects are processed by the systemto produce outgoing data stream objects that are routed to theappropriate PE or to storage. The PEs can be either statelesstransformers or much more complicated stateful applications. Thecooperative data stream processing system through the job managementlayer identifies the PEs in the submitted jobs and builds one or moresubjobs 123 from the PEs of different jobs by linking these PEs,possibly reusing them among different subjobs, to enable sophisticateddata stream mining. Therefore, even though the PEs are initiallyassociated with a given job, the PEs can be re-associated into one ormore subjobs in order to facilitate the desired data stream mining.Thus, the PEs of a given job can be associated with the same subjob orwith different subjobs and can run on either the same or differentprocessing nodes 125 within the system. Preferably, the PEs of a givenjob are assembled into one or more subjobs associated with a single jobtranslated from a single distributed plan derived from a given inquiry.In one embodiment, the job management layer 118 within each site isresponsible for initiating and terminating jobs through the creation andinitiation of the subjobs containing the PEs of the jobs. In oneembodiment, each job management layer is in communication with anoptimizing scheduler 121 that allocates nodes to PEs based on criteriaincluding priority, inter-node connectivity and bandwidth requirements.As illustrated, the job management layer is responsible for the creationand initiation of subjobs on the various nodes. Alternatively, the jobplanner in the INQ layer includes the functionality to define subjobsand associate these subjobs with the appropriate nodes.

The system also includes a stream processing core (SPC) 118 that managesthe distributed execution of the PEs contained within the subjobs. TheSPC includes a data fabric 120 component and a storage 122 component.The data fabric component facilitates the transport of data streamsbetween PEs and persistent storage, i.e., storage 122. Therefore, datacan optionally be routed to storage as needed. A nano-scheduler providesadaptive connectivity and fine-grained scheduling of communicatingsubjobs. In one embodiment, the nano-scheduler is located within thescheduler 121. The scheduler 121 is a three-tier scheduler. The firsttier is a macro scheduler running at longer time scales and decidingthings such as which jobs to run. The second tier is a micro schedulerrunning at short time scales and dealing with changes in system state.The third tier is a nano scheduler running at the finest time scale anddealing with flow variations. The storage component uses value-basedretention to automatically reclaim storage by deleting the leastvaluable data at any given time. Results flow back 124 from PEs to theUE layer for delivery to the requesting user.

Each one of the plurality of sites within the cooperative data streamprocessing system runs an instance of the system architectureillustrated in FIG. 1. Therefore, as used herein, each site is aself-contained, fully functional instance of the cooperative data streamprocessing system of the present invention. In one embodiment, each siteruns an instance of each component of the system architecture asdescribed above in addition to a fault-tolerant service. In oneembodiment, each site belongs to a distinct organization and has its ownadministrative domain, i.e., administrators who manage one sitegenerally exercise no control over the other sites within the system. Inthis respect, the process of distributing cooperative data streamprocessing systems of the present invention among multiple sites issimilar to Grid Computing. Cooperation among the plurality of sites isachieved by the sites negotiating peering relationships, for exampleoffering resources to each other while retaining a desired level oflocal autonomy. In one embodiment, two or more sites within thecooperative data stream processing system that want to collaborate for acommon goal and benefit negotiate and form one or more virtualorganizations (VOs). The sites can be homogeneous, heterogeneous orcombinations of heterogeneous or homogeneous sites.

Exemplary embodiments of cooperative data stream processing systems inaccordance with the present invention are powerful processing systemscapable of solving complex analysis problems. Cooperation among theplurality of distinct, distributed sites enhances the capabilities ofthe cooperative data stream processing system. With regard to thebreadth of analysis provided by the cooperative data stream processingsystem, a single organization addresses a set of problems that requiredata analysis by processing only the relevant data that the singleorganization alone is able to access. However, when two organizationswork in conjunction, a larger and more diverse set of data is availablefor analysis. This increase in the size of available data expands therange of problems that can be analyzed, improves the quality of theresulting output of the analysis and facilitates the addition ofanalysis types not available in a single organization. For example, amultinational financial services company might perform detailedacquisition and analysis of companies, economies and politicalsituations within the local geographic region of each of its analysissites. These various sites could interoperate minimally by default, butcooperate closely upon a significant event or when analysis ofmultinational organizations is required.

Cooperation enhances both reliability and scalability within the system.With regard to reliability, the reliability of one site is significantlyimproved through the use of agreements with other sites to take over keyprocessing and storage tasks when failures occur. With regard toscalability, cooperation among sites provides increased scalability asextreme scalability cannot be achieved through unbounded growth of anindividual site. The cooperation of multiple autonomous sites achievesmuch higher levels of scalability. In addition, cooperation across sitesallows offloading of processing demands to other sites when one siteexperiences a workload surge.

Cooperative data stream processing systems in accordance with thepresent invention support a range of distribution or peering models,ranging from basic models to sophisticated models. In one embodiment,the system is arranged to support a range of different peering modelsbetween sites. Negotiated common interest policies (CIPs) define therelationships among sites, and thereby the formation of virtualorganizations (VOs). Although each VO can be a distinct entitycontaining an exclusive grouping of sites, different VOs may overlapwith one another, i.e. may contain the same sites. Therefore, any one ofthe plurality of sites can participate in multiple VOs. This structurefacilitates basic point-to-point, i.e., site-to-site, peering andpeering between entire VOs having sites arranged in hierarchical,centralized or decentralized arrangements. For simplicity, thedistribution models discussed below are described in the context ofbasic point-to-point interaction between sites.

In one embodiment of a basic distribution model, all processing takesplace at a home site, i.e., the site performing an inquiry and makinguse of resources from other sites. Data source sharing is achieved bydirectly shipping data from remote sites across the network forprocessing at the home site. Shared data sources include real-time datastreams and stored data. Implementing this distribution model createsthe necessity for distributed data acquisition capabilities to identifyand to access remote data sources and a stream processing engine thatcan send and receive streams remotely. One advantage of the basicdistribution model is simplicity. Data from another site is used withlocal processing, and the amount of processing and network bandwidthresources consumed are related to the volume of the data streamsoriginating at remote sites. Larger volumes of transferred data,however, consume more resources. Primal streams in particular consumelarge amounts of resources in this distribution model as these streamsundergo little to no processing at the remote site to reduce their size.Derived streams may be at a more manageable data rate, presenting lessof an issue, but in some cases even a derived stream is voluminous.

In another embodiment of the distributed processing model, preliminaryprocessing of a data source is conducted at the site from which the datasource originates. This arrangement addresses the issue of sending largeamounts of data across the network. In addition, duplicate processing isreduced when two or more sites want to access the same data source froma third site and need to perform the same or similar processing. Thisapproach adds complexity, however. If a data source is not already beingaccessed on the remote site, then processing must be initiated there onbehalf of the home site, raising issues of trust between the cooperatingsites, as one site is asking the other site to execute potentiallyarbitrary code on its behalf. The trust issue is addressed using the CIPthat exists between the sites. One aspect of a CIP reflects thearrangement each site has negotiated by specifying the data sources eachsite is willing to share and the types of processing each site iswilling to perform on the shared data sources.

Other distribution models achieve more distributed processing. In oneembodiment, effective ownership of some resources in the remote site istransferred to the home site. Therefore, the scheduler located at thehome site allocates those resources or processing nodes for whichownership has been transferred to the home site. This model is referredto as resource partitioning and requires a relatively high level ofcooperation and trust between the remote site and the home site. Inanother embodiment, processing is scheduled by the remote site andincludes commitments regarding the allocated resources. In thisembodiment, which is effectively a service-level agreement (SLA) model,a greater degree of site autonomy is maintained. In addition, this modelfacilitates sharing when multiple sites want to access the same datastream.

In another embodiment of the distributed planning model the availabilityof both data sources and processing resources at multiple sites areconsidered as part of the planning process. For example, if the homesite requires several data sources from a remote site, the most logicalsolution may be to send an entire job or subjob over to that remote siteas opposed to communicating the data sources from the remote site to thehome site. Similarly, a given set of PEs may be broken down anddistributed among a plurality of sites according to the availability ofdata sources and the processing capability at each site. In order topartition a processing graph intelligently, the availability of datasources, PEs and processing resources at each site must be known.Therefore, the identification of other job components running at aspecific site and how important these jobs are in comparison to the onebeing planned are taken into consideration. In addition, the executionof the distributed plan is monitored closely to ensure that each siteinvolved is operating effectively and that the overall plan is executingas efficiently as possible across the sites. Execution issues discoveredvia monitoring feedback can trigger re-planning of the entire job or aportion of the job.

Preferably, a combined model approach to distributed planning is used.This combined model approach is more complex than the models describedabove; however, the combined model is the most powerful model. Thecombined model approach receives support from several components in thecooperative data stream processing system architecture including the INQlayer and the scheduler. A higher degree of interoperability and trustbetween sites is utilized by the combined model approach. This higherdegree of trust can be based, for example, on the CIPs for the pluralityof sites within the cooperative data stream processing system. Ingeneral, however, distributed planning is a central feature tosystem-wide or region-wide effectiveness and efficiency. Multiple sitesthat cooperate for the good of the entire system as a whole, rather thanoptimizing independently and in isolation, optimize the use of resourcesby optimizing the subdivision and placement of jobs according to theirinputs, execution patterns and priorities, among other factors.

In one embodiment, an increased level of integration is provided bysituating a given instance of the job management layer and scheduler toencompass multiple sites. Therefore, this instance of the job managementlayer and the scheduler optimize multiple sites concurrently, treatingthese sites as a whole. This increased level of integration requires thegreatest amount of interoperability and trust between sites. Dependingon the degree of integration, sites can be either cooperative, in whichthe sites work toward certain common goals but retain a significantamount of autonomy, or federated, in which sites subordinate to a singlelead site. In one embodiment, the integration arrangement among thesites is expressed in the CIPs.

As was discussed above, when two or more sites located within thecooperative data stream processing system of the present invention agreeto interoperate to achieve common or distinct goals that this sites wereare unable to achieve in isolation, the sites form a VO. An example offorming VOs in the context of grid computing is described in Ian Foster,Carl Kesselman and Steven Tuecke, The Anatomy of the Grid: EnablingScalable Virtual Organizations, Lecture Notes in Computer Science, 2150(2001). In forming a VO, the member sites agree, i.e. negotiate, oninter-operational terms. These negotiated terms are formulated into aCIP for that VO. As member sites of a given VO, each site shares varioustypes of data and processing resources in accordance with the CIP.

In defining the interactions among the member sites, each site agrees toa predetermined style of interoperation for the VO, i.e. cooperative orfederated. A federated VO includes an appointed lead site for the VO.The lead site assumes a coordination role and is able to exert a levelof control over the other sites. Federated VOs function best when themember sites share a common set of goals. The lead site is able tooptimize resource and processing usage to support the common good of theVO or at least the good of the lead site. A cooperative VO lacks acentral point of authority. The VO members interact as peers. Eachmember site is independent of the other sites and may have a separateagenda. However, the member sites recognize that operating in acooperative manner increases the overall fulfillment in each independentgoal.

In general for all VOs, the CIP includes the terms and conditionsgoverning the interoperability among the plurality of member sites ofthe VO. In one embodiment, the CIP identifies the data such as datastreams and locally stored data that are shareable via remote access.This identification includes identifying classes of data streams andother data based on their attributes, since it may not be possible atthe time the CIP is created to predict the data streams and other datathat will exist in the future. A given CIP references the classes withinthe terms for that CIP. For example, a given data stream is taggedglobally public, locally public or private, and a CIP term is createdthat grants read accesses for all globally public streams. As anotherexample, a data stream is tagged as coming from a publicly accessiblesensor, e.g., a traffic camera, and the CIP contains a term that statesthat public sensors are freely shared. In one embodiment, a CIP term isgeneral and specifies that any data source located in a particularlocation, e.g., city, is shared, without such explicit tagging.

The CIP also includes terms to identify resources such as processingresources that are sharable. These terms identify member sites thatsupport remote inquiries and, therefore, support the distributedplanning interaction model. In addition, these terms identify membersites that only support the distributed processing and distributed datasource interaction model. In one embodiment, the CIP terms identify thetypes of raw processing resources that are available to be shared.Suitable processing resources include, but are not limited to, centralprocessing unit (CPU), memory, storage, software and hardware includingspecial processing hardware. The types of available raw processingresources identify the VO as supporting the resource partitioning model,the SLA-based model or both models. The CIP terms can also identify themember sites that are available to assist in failure recovery processesand the degree of assistance available from each one of these membersites. Additional resources that can be shared include, but are notlimited to, capabilities including the capability to monitor sites andto recover failed jobs. Such capabilities use CPU and memory resourcesbut are different than the actual raw resources.

The processing resources within the VO can be offered to all membersites of the VO. Alternatively, the processing resources are offered toonly a subset of the member sites, as specified in the terms of the CIP.In one embodiment, anything that is not explicitly offered in a CIP isnot allowed. By specifying these terms in the CIP, each VO member siteis advertising resources that another VO member site may request to use.However, the ability of other member sites to actually use theseresources is not guaranteed. Some resources are limited in nature, and,therefore, the site providing these limited sources may not be able tosatisfy all requests from all consumer sites simultaneously, at leastnot with the quality of service that the consumer sites expect.Therefore, in order for a VO member site to reserve an exclusive use ofthe limited resource, this member site establishes an agreement with theproviding member site. This agreement is used in both the SLA andresource partitioning model described previously.

In addition to defining the set of agreements that are possible in a VO,the CIP specifies the particulars that are available for an agreement,for example the quality of service levels, costs and limitations on theresource usage. Once established, a given resource agreement isreferenced every time a request is made for that resource. The terms andconditions of the agreement, in addition to the costs and penalties, arecontinuously monitored by auditing functions located at both sites thatare members to the agreement, i.e., the sites providing and consumingthe resource.

In the cooperative data stream processing system of the presentinvention, the CIPs provide the creation templates that are used tocreate agreements between the provider and the consumer of the resourceto be shared. These templates are used to create an actual agreement toaccess particular resources over a specified time interval. In addition,the CIPs define higher-level business interaction schemes between VOmember sites. For example, the stakeholders of a given site can specifyin the CIP not only the types of possible interactions between the VOmember sites, but also the conditions under which agreements can orcannot be established. CIP terms can be made within a VO-wide contextand not just in the context of two member sites. In addition todescribing the interoperation terms between member sites regardingresource sharing, the CIP also contains the technical communicationdetails that are necessary to establish the communication channels amongthe various member sites. In one embodiment, the member sites that aremembers of a given VO are heterogeneous, for example having differentdata formats and security labels. To overcome issues related to handlingheterogeneous systems, the CIP contains information regarding the kindof environment mapping required in order for the various types of siteswithin the VO to communicate.

Each site within the cooperative data stream processing system is notlimited to being a member of only one VO. A given site can be a membersite in a plurality of different VOs, both federated and cooperative.However, although member sites of a given VO interact and cooperate,member sites of different VOs are not allowed to interact directly witheach other. If a given site attempts to use resources from multiple VOs,that site must interact separately and process data with each VO, thenmerge the results locally for potentially further processing and presentthe merged final results to the user, subject to the constraints in themultiple VOs' CIP terms as agreed.

In one embodiment, a given VO can join as a member of another VO,forming a hierarchical VO structure. The joining VO honors anyinteroperation terms that are expressed in the CIP of the VO to which itjoins. The joining VO uses the resources of its member sites resourcesto fulfill requests in accordance with the interoperation terms. How themember sites of the joining VO are used depends upon the type of VO. Fora federated VO, the VO lead site delegates requests to the joining VOmember sites as the lead site determines is appropriate. A cooperativeVO that joins as a member of a larger VO requires extensive negotiationto specify in the CIP how the member sites of the cooperative VO can beused.

Referring to FIG. 2, an exemplary embodiment of a complex VO structure200 in accordance with the present invention is illustrated. Asillustrated, triangles represent federated VOs, and ovals representcooperative VOs. Individual member sites are represented as circles, andfederated lead sites are squares. The structure includes a plurality ofVOs 202, and each VO contains a plurality of member sites 204. One ofthe plurality of VOs is an isolated federated VO 206 (FVO#1), and one ofthe plurality of VOs is an isolated cooperative VO 208 (CVO#1). Sincethe member sites in these VOs are not members of any other VOs, the onlysites they are able to interact with are the other members of that sameVO. For example, site A is only able to interoperate with sites B, C,and D. A second federated VO 210 (FVO#2) contains three member sites,lead site I and participant sites J and K. In addition, the secondfederated VO 210 includes a member that is itself a cooperative VO 212(CVO#2). One of the member sites 214 (K) is also a member site of acooperative VO 216 (CVO#3). This cooperative CO also includes threeother member sites. Another federated VO 218 (FVO#3) is provided havingfour member sites, and the lead member site 230 (R) is also a membersite of one of the cooperative VOs 216.

These mixed and overlapping hierarchical VO structures allow verycomplex structures to be created. Care is taken in constructing thesestructures to avoid creating operational issues. For example, the secondcooperative VO 212, while organized as a cooperative VO, is joined to afederated VO 210. Therefore, the member sites of the joining cooperativeVO agree to some degree to a higher level of control from the leadmember site of the federated VO. Therefore, when a VO, eithercooperative or federated, joins another VO, all member sites areinvolved in the decision as the decision affects all the member sites.If a federated VO joins another VO as a member, only the lead site isneeded for the decision, because the lead site exerts authority on itsmembers. In general, joining a cooperative VO causes less impact on thejoining member sites, because the joining members retain a high degreeof individual control. When a federated VO lead site 220 joins acooperative VO 216, that lead site maintains a high degree offlexibility in delegating work to member sites in the federated VO,since the lead site retains control over the members of the federatedVO. This ability of a lead site to delegate or off-load responsibilitiesenables the lead site to re-mission its resources to better fulfill anyrequests imposed on it due to its membership in the cooperative VO.Because the member sites (S, T, U) in the federated VO (FVO#3) are notin the cooperative VO (CVO#3) like the federated VO leader site, thesesites are not able to interact directly with the other members of thecooperative VO (CVO#3) and must interact via the federated VO leadersite (R).

Although sites and VOs may be members of multiple VOs simultaneously,they are not allowed to join a VO if this would cause a conflict withtheir existing peering relationships. For example, if a site is a memberof a VO that requires it to share a given resource with a second site,that site is not allowed to join another VO that prohibits the sharingof this same resource with the same site, unless that site withdrawsfrom the first VO. In one embodiment, a given site can choose whichterms it wants to adhere to and which VO it wants to join.

As used herein, resource awareness refers to the discovery and retrievalof information about data sources, PEs and other kinds of resources, forexample execution resources and active inquiries, among multiplecollaborating sites. Each site stores information about such resourcesin relational or semantic data stores. In one embodiment, the instanceof the data source management component on each site maintains low-levelcharacteristics, e.g., delays and data rates, about data sources in arelational database and semantic descriptions in a semantic metadatastore. The component that provides the discovery and retrieval ofinformation about remote resources is the resource awareness engine. Theresource awareness engine is in communication with the other componentson a given site and is used by these components to retrieve desiredinformation. For example, if a distributed job planner needs to know thekinds of data sources and PEs that are available at remote sites inorder to produce global plans that utilize resources in a VO, thedistributed job planner uses the resource awareness engine to accesssuch information about other sites. The same applies to PEs and otherkinds of resources as well.

The resource awareness engine provides a layer of indirection betweenendpoints. For example, a store or a client does not need to interactwith the other end directly. The ability to eliminate the need forinteraction between endpoints is particularly beneficial when there aremany endpoints. The resource awareness engine provides a universalinterface that endpoints use to communicate, and the resource awarenessengine conceals underlying complexities and dynamics so that theendpoints always see the same interface. The addition or withdrawal ofany site is handled by the resource awareness engine and becomestransparent to each client.

The resource awareness engine provides two kinds of interfaces. Thefirst interface is a search interface, which is the “pull” mode ofresource discovery. A client sends a query to the resource awarenessengine, specifying the resources that are requested. The resourceawareness engine searches and returns matching resources from multipleremote sites. The second interface is a publish/subscribe interface,which is the “push” mode of operation. Sites having resources toadvertise and share with other sites publish the information to theresource awareness engine. Sites requiring resources subscribe to theresource awareness engine and specify the resources needed. The resourceawareness engine actively pushes matching resources to the requestingsites. These two interfaces fulfill different needs within the system.The “pull” mode interface is suitable for clients, for example thefailover site selection component, that request dynamically changingresources once in a while, only upon infrequent events, e.g., sitefailures, and only requiring the most up-to-date information. The “push”mode interface is suitable for clients, for example the Data SourceManager, that want to keep updated about continuously changinginformation, not just current but also past information. This interfacekeeps the client up to date about variations. A client may use acombination of “pull” and “push” interfaces for different types ofresources as well.

Two different engine components in the resource awareness engineinteract with system endpoints. These components are the exportercomponent of the resource awareness engine and the importer component ofthe resource awareness engine. The exporter component is responsible forinteracting with a resource store that has data to publish or that iswilling to accept external queries. The exporter component receivesresources advertised by the store and relays these resources to theimporter component. Alternatively, the exporter component receivesqueries from importer components, forwards these queries to the resourcestore and returns results. The importer component interacts with sitesthat request resources. The importer component receives queries from thesites and relays these queries to the exporter component. Alternatively,the importer component accepts subscriptions from sites and activelypushes matching resources back. In one example of data source discoveryusing the resource awareness engine, an existing single site componentmanages resource stores. When a client, for example a distributedplanner, needs to discover remote data sources, the client sends a queryto its local importer component. The importer component checks the CIPto identify sites that it can search. The importer component forwardsthe query to the exporter component of the identified sites. Theexporter component checks the CIP to ensure the requesting site isallowed to access the resources. If so, the exporter component forwardsthe query to the Data Source Manager (DSM) component, which returns theresults. Eventually the matching data source records are returned to theclient.

Remote data sources can also be located using the push mode of operationof the resource awareness engine. For example, remote sites activelypublish information about data sources through their local exportercomponents. The distributed job planner, or the DSM component that actson behalf of the job planner, sends a subscription to its importercomponent. The importer component notifies other exporter components.Whenever matching data sources are published, exporter componentsactively push the matching data sources to the importer component andeventually to the client.

In one embodiment, the resource awareness engine provides the “pull”mode resource discovery by organizing the resource awareness enginecomponents located on multiple sites into an overall hierarchy. Theresource awareness engine component of each site chooses the resourceawareness engine of another site as its parent. Multiple such sites cancollectively form a tree structure. The hierarchy of the tree structurecan naturally follow existing administrative relationships within anorganization that owns multiple sites. This hierarchy can be used in afederated VO. Organizational peers, which are not subordinate to eachother, negotiate among themselves and determine the hierarchy formation.Thus, this hierarchy formation can also be done in a cooperative VO. Theexporter component at each site summarizes its resources, e.g., datasources, in aggregated forms and sends the summary to the importercomponent of its parent site. The aggregate resource summary is acondensed representation of the original resources, e.g., data sourcerecords, and supports attribute-based searching. The aggregate resourcesummary can take many different forms. For example, a histogram form canbe used to summarize the DATA-RATE attributes of the video data sourcesof a site. Multi-resolution compression techniques can be used as well.A description of multi-resolution compression techniques is given inDeepak Ganesan et al., Multi-resolution Storage and Search in SensorNetworks, ACM Transactions on Storage, August 2005. The importercomponent of a parent site further aggregates the summaries from itschildren sites and sends these summaries up the hierarchy. Therefore,summaries are aggregated and propagated bottom-up through the hierarchy.The root of the hierarchy has a global summary of all the resourceswithin the hierarchy, and each site has a branch summary of resourcesowned by its descendants.

The discovery of data sources initiates in the root resource awarenessengine. An importer component from a client site sends a request to theimporter component of a root site. The root site examines its ownresources and the summaries of the resources of its children. The rootsite returns its eligible resources to the client and instructs theclient to search the child branches of the root site that containmatching summaries. Through this mechanism, the client discoverseligible resources from all sites. In one embodiment, replicationoverlays are used to eliminate potential performance and failurebottlenecks at the root importer component.

The “push” mode of the resource awareness engine uses a semantic pub/subsystem that matches events to subscriptions. Events are the semanticdescription of advertised resources in resource description framework(RDF) triples. Each triple has a subject, a predicate and an object anddescribes the relation between the subject and object. For example,Camera51 locatedIn NY indicates “Camera51” is located in “NY”. A set ofthese triples can represent the semantic information of resources suchas data sources. Subscriptions are RDF triple patterns. The RDF triplepatterns are similar to triples, but some elements can be variables. ?xlocatedIn NY represents any subject “?x” that is located in “NY”. Thesemantic matcher receives events for resources from exporter componentsand subscriptions from importer components. The semantic matcher uses asemantic reasoner to deduce facts from ontologies, which contain formalrepresentations of domain knowledge such as the location relationship ofall cities and states in the U.S., and decides which events match towhich subscriptions. An example of a semantic reasoner is described inJ. Zhou, L. Ma, Q. Liu, L. Zhang, and Y. Yu. Minerva, A Scalable OwlOntology Storage and Inference System, The First Asian Semantic WebSymposium (2004).

Failures can occur within exemplary cooperative data stream processingsystems of the present invention in a variety of ways. Individual PEs orsubjobs can fail. Various system components, both hardware, e.g.,storage and computation nodes, and software, e.g., INQ, DSM, can alsofail. The failure of components at a minimum causes the degradation ofthe capabilities of the site and at worst cause the failure of theentire site. Even partial failures of components can dramatically impactthe capacity of a site.

Failure recovery is important both within a site and between sites.Given the ability to recover across sites, say from a checkpoint, thetechnology to recover within the same site also exists. Therefore, theemphasis is on cross-site or inter-site failure recovery, and theexistence of certain intra-site failure recovery functionality isassumed when needed. Multi-site failure recovery requires mechanisms forsupporting recovery and policies governing issues such as site selectionand frequency of checkpoints.

Support of failover depends on the types of subjobs being executed. Manynon-critical subjobs can be terminated under appropriate circumstances.These subjobs need no special support for recovery when the subjob orthe nodes on which the subjobs run fail. Subjobs that are moreimportant, yet not critical, can be restarted from scratch upon afailure without significant loss to users. A relatively small butcritical fraction, however, should be resumed after a failure withoutloss of state. For these, failure recovery techniques are required.Suitable failure recovery techniques are known and available in the artand include process-pairs, for examples as described in Jim Gray andAndreas Reuter, Transaction Processing: Concepts and Techniques, MorganKaufmann (1992), and checkpointing, for example as described in TheoHaerder and Andreas Reuter, Principles of Transaction-Oriented DatabaseRecovery, Readings in Database Systems (2nd ed.), pages 227-242, MorganKaufmann Publishers Inc., San Francisco, Calif., USA (1994). Thesetechniques work well for recovering within a site. In addition, thesetechniques can be used to run critical subjobs on another site, eitherin parallel (process-pairs) or upon a failure (checkpointing). However,the overhead of maintaining the state across multiple sites will besubstantially higher than within a more tightly-coupled site.

To handle failures of hardware system components, two mechanisms areavailable. The first mechanism is load shedding and rebalancing withinone site. After a failure of some nodes, low-priority jobs can be killedor suspended to make room for high-priority ones. High-priority jobs canalso be redistributed among the remaining nodes, thus rebalancing theworkload on the functioning nodes. The second mechanism is inter-siteoffloading. If the workload of important jobs in a site exceeds thecapacity of the remaining nodes, the site can shift some of itshigh-priority jobs to other sites. In one embodiment, the sitespre-arrange CIPs among them to determine which jobs to offload and howto offload these jobs. Executing in another site faces heterogeneity inavailable data sources, execution environments, competing executionpriorities and other issues. Therefore, executing jobs on alternativesites preferably is used as a last resort. In rare instances, an entiresite may fail as the result of a natural disaster such as floods orearthquakes or the simultaneous failure of each instance of a criticalsystem component. The primary difference between partial and total sitefailure is that in the former case, the affected site can initiaterecovery actions, while in the latter case, another site must detect andrespond to the failure. The choice of which site (or sites) backs up agiven site is negotiated in advance, based on the CIP(s). Critical data,such as the state necessary to run specific subjobs and the stored dataupon which those subjobs rely, are copied to the backup site(s) inadvance. Any subjobs that are critical enough to be checkpointedperiodically or run in parallel via process-pairs are coordinated acrossthe sites.

The CIPs between sites provide for significant flexibility in decidinghow to respond to failures. A plurality of factors is considered inmaking this decision regarding how to respond to failures within thesystem. One factor looks at which site or sites should backup a givensite. Some sites are excluded from serving in a back-up capacity due toeither unwillingness or incompatibility. If multiple sites are availableas satisfactory backups, a subset of these potential sites isidentified. In one embodiment, site reliability and associated costs aretaken into consideration when identifying the subset. The jobs or workassociated with the failed site are divided among the sites in theidentified subset. In addition, a determination is made regardingwhether the assignment of backup sites is optimized by each siteindividually or decided for the benefit of a group of sites as a whole.The assignment of jobs will be handled differently in a federated VOversus a cooperative VO. Failure recovery or failure tolerance can alsobe provided through check pointing. For a given subjob, a determinationis made about how often and under what conditions checkpoints shouldtake place. In one embodiment, the current state is check pointed morefrequently to support intra-site recovery than for inter-site recoveryas checking pointing for inter-site recovery entails higher overheadcosts. The decision regarding how often and how much back-up data tostore weighs the need for a sufficient amount of reliable data againstthe storage limitations of each site and the ongoing storage needs ofeach site. For replicated persistent data, value-based retentioninteracts with the reliability of the data as described in RanjitaBhagwan et al., Time-Varying Management of Data Storage, First Workshopon Hot Topics in System Dependability, June 2005. In addition, eachextra copy of backed-up data takes space away from a site's own data,some of which may have only one copy.

Exemplary embodiments of the cooperative data stream processing systemin accordance with the present invention manage the inherentheterogeneity of the multiple collaborating sites. Each site can have adifferent operating environment, in terms of the runtime environment,system type, security and privacy policy set, user namespace, amongother aspects. These points of differentiation are managed to allow thesites to interoperate.

Each site within the cooperative data stream processing system has itsown runtime environment, including PEs, stored data, and type system,with potentially different names, formats, functions or interpretation.For example, a first site uses a 5-character string for the zip code,and a second site uses a full 9-digit zip code. In addition, a thirdsite might not use the zip code at all. The present invention utilizestransformation and mapping rules as well as routines between sites toensure that collaborative subjobs use PEs, stored data and typescorrectly across sites. In addition to inter-site variability in therepresentation and formatting of data, PEs, stored data and type systemsevolve over time. The version of a given data set can differ from onesite to another. Since subjobs using different versions of the same PE,stored data, or data types can co-exist, an evolution history isrequired. Suitable evolution histories use mechanisms such asversioning. The transformation and mapping should also handle suchevolutions, both intra-site and inter-site.

Another source of heterogeneity among the sites are the security andprivacy policies of each site. Collaborating sites can have identical ordifferent security and privacy policies. When a single organizationoperates many sites, or all sites have high degrees of mutual trust anduniformity, a single security and privacy policy can be adopted under acommon user namespace. The cooperative data stream processing systemassumes either lattice-based secrecy, as described in Ravi Sandhu,Lattice-Based Access Control Models, IEEE Computer, November 1993, orintegrity policy models, as described in IBM, Security in System S,http://domino.research.ibm.com/comm/research_projects.nsf/pages/system_s_security.index.html(2006). In one embodiment, each site within the system is provided withan understanding of the format and implied relationships of the securitylabels used by all sites within the system. The access rights andrestrictions encoded within a security label are uniformly applicablethroughout all the sites.

When multiple sites belonging to different organizations collaborate,however, uniform policies may not be feasible. In one embodiment, eachsite within the system defines its own security and privacy policies.All sites define secrecy levels and confidentiality categories for theirsubjects and objects; however, the numbers of secrecy levels, sets ofcategories and their meaning and interpretation vary from site to site.The user namespace also varies and can be completely separate from onesite to another. In order to account for variations in security andprivacy policies, policy translation and mapping are used. For example,in a collaborative hurricane response and recovery system, a givenprivate organization uses two secrecy levels, public andorganization-confidential, and no categories. A governmental agency, forexample the Federal Emergency Management Agency (FEMA) dealing with thesame situation uses four secrecy levels (unclassified, confidential,secret and top-secret) and a large set of categories, including acategory Organization-NDA assigned to subjects to deal withorganization-confidential information. The policy translation andmapping rules define that organization sites provideorganization-confidential data only to agency subjects cleared to atleast the confidential level and having the category Organization-NDA.

An architecture was described above for the individual componentssupporting cooperation in the cooperative data stream processing system.Referring to FIG. 3, an exemplary embodiment of the functions thatfacilitate cooperation in combination with the system architecture 300is illustrated. The plurality of functions 320 supporting cooperationare aligned with the architectural components to which each functionrelates. In one embodiment, each site runs an instance of each componentof the architecture and employs the set of functions as illustrated.

A first function is VO management 322, which is utilized by the userexperience component 310. VO management has the greatest degree ofdirect interaction with end users, for example site administrators.Included within VO management are CIP management for activating,deactivating and maintaining CIPs, VO membership management for trackingwhich sites are in a VO and the roles of each site within a given VO,agreement management for enacting agreements with other the sites and VOservices including accounting and SLA monitoring. Administrators foreach site and each VO interact directly with VO management to create andupdate CIPs.

The plurality of functions also includes a VO planner 324 that workswith the INQ component 312 to facilitate inter-site planning. The VOresource awareness engine (RAE) provides information about availableresources and interacts with DSM 314 as well as the INQ component 312.The remote execution coordinator (REC) 328 extends JMN layer 316 to themulti-site case by supporting distributed jobs. The tunneling function330 extends the data fabric component of the stream processing core(SPC) 318 across sites by transmitting data from a PE on one site to aPE on another. In addition to functions that integrate with one of thelayers in the system architecture, the plurality of functions 320 alsoincludes functions that interact with multiple components in the core,i.e., single-site, architecture. The VO failover management (FM) 332handles backup site arrangements, check pointing and recovery afterfailure. In addition, VO FM 332 incorporates heartbeat management (notshown) for tracking the availability of sites. The VO heterogeneitymanagement (HM) 334 function manages the mapping and translation fortypes, schemas, ontologies and security and privacy labels, amongothers.

The components and associated functions illustrated in FIG. 3 arereplicated on each site within the system. In addition, the variouscomponents can appear as either a participant or a lead within a VO.Participants interact with other components on a site and relay variousrequests to the leads for processing. For example, in a federated VO, afederated plan lead component takes an inquiry, builds a distributedplan and invokes appropriate components on each participating site todeploy that part of the plan.

Referring to FIG. 4, an exemplary embodiment of a distributed planningscenario 400 within a federated VO using SLAs in accordance with thepresent invention is illustrated. The federated VO includes a lead site402, a first participant site 404 and a second participant site 406. Aninquiry 408 is submitted from the instance of the user experience (UE)component 410 on the first participant site 404 and is received by theinstance of the VO plan participant 412 on the same site. The VO planparticipant 412 obtains from the VO management participant theidentification of a plan lead 416 for the submitted inquiry and forwardsthe inquiry 418 to the VO plan lead 420 on the lead site 402. The VOplan lead examines the inquiry and sends a resource request 424 to theVO RAE-I 422 for information about where appropriate resources areavailable. The VO RAE-I 422 sends a request to check the CIP 428 to theVO management lead 430 to determine whether the CIP allows particularresources to be shared. The VO RAE-I 422 returns a list of appropriateresources 426 to the VO plan lead 420. The appropriate resources areavailable for use for plan inclusion. From this list of possibleresources, the VO plan lead 420 chooses providers for needed resources,and dispatches the job 432 to the remote execution coordinator (REC) 434on the lead site 402. The REC 434 on the lead site recognizes andseparates the portions of the job that are destined for execution onother sites within the VO. The job portion that is destined forexecution locally on the lead site is submitted to the local JMN 438 forexecution. The local JMN 438 starts the PEs 440 using the local SPC 442on the lead site. These PEs are connected to the tunnels 444 using thetunneling function 446 local to that site to return SDOs to the sitesaccessing them. Some of the above described details may vary in otherembodiments. For example, a DSM component may send resource requests onbehalf of the VO plan lead to retrieve data source information, and theVO plan lead asks its DSM for both remote and local data sourceinformation.

A similar job submission sequence is repeated once for each remote orparticipant site. For the first participant site, the REC 434 on thelead site 402 dispatches the appropriate the job portion 448 that isdestined for execution on the first participant site 404 to the REC 450on the first participant site. This REC submits the jobs 452 to itslocal JMN 454 for execution. The local JMN 454 starts the PEs 455 usingthe local SPC 456 on the first participant site. These PEs are connectedto the tunnels 458 using the tunneling function 460 local to that siteto return SDOs to the sites accessing them. Similarly, for the secondparticipant site 406, the remote execution coordinator (REC) 434 on thelead site 402 dispatches the appropriate the job portion 462 that isdestined for execution on the second participant site 406 to the REC 464on the second participant site. This REC submits the jobs 466 to itslocal JMN 468 for execution. The local JMN 468 starts the PEs 470 usingthe local SPC 472 on the second participant site. These PEs areconnected to the tunnels 474 using the tunneling function 476 local tothat site to return SDOs to the sites accessing them. The SDOs aretunneled 478 as they are produced through to the site originating theinquiry. The SPC 456 on the first participant site, i.e. the siteoriginating the inquiry, returns results 480 to the user as the resultsare obtained.

Interoperation among a plurality of sites within a given cooperativedata stream processing system of the present invention requiresdistributed planning among the sites, inter-site and intra-site resourceawareness and distributed execution and failure recovery. With regard todistributed planning, a VO planner is implemented that can utilize datasources and PEs from each one of the plurality of sites in the VO andthat can produce distributed plans. The VO planner accepts inquiriesthat describe the desired final results in inquiry specificationlanguage (ISL). In one embodiment, the semantic description of thecontent of remote data sources and the required input and output streamsof PEs are represented using a Web ontology language (OWL) files asdescribed, for example in W3C Recommendation, Web ontology language(OWL), February 2004. These OWL files are replicated at the sitecontaining the VO planner. Since the semantic descriptions arerelatively static, these files do not change frequently. When a sitejoins a VO, that site can copy these files over to the site for the VOplanner site.

The VO planner, having received the inquiries, optimizes and balancesbetween multiple objectives such as quality of results, resourceutilization, security risks, communication delay and bandwidth betweensites in order to plan the execution of the inquiries. An example ofsuitable planning is described in Anton Riabov and Zhen Liu, Planningfor Stream Processing Systems, Proceedings of AAAI-2005, July 2005. Inone embodiment, multiple Pareto-optimal distributed plans are producedin the form of flow graphs, which consist of PEs and data sourcesinterconnected together. These plans have different performance vs. costtradeoffs and can be provided to either the user or a distributedscheduler to decide which plan to deploy. The VO planner partitions thechosen plan into a plurality of sub-plans. Each sub-plan is assigned toa site within the cooperative data stream processing system forexecution. The VO planner also inserts tunneling PEs into the sub-plans.These tunneling PEs handle inter-site transport of data streams.

Implementations of the resource awareness engine allow any site withinthe cooperative data stream processing system to discover desiredinformation, for example, available data sources, PEs and resources,from other sites within a common VO. In one embodiment, a pull mode isused to discover the desired information. The pull mode utilizes twocomponents, a server and a resolver. The server functions as theexporter. An instance of the server resides at every site and producessummaries about information at that site. The resolver functions as theimporter. A client, e.g., a VO planner or its DSM acting on behalf ofthe planner, requesting information sends the appropriate query to itslocal resolver. By checking the CIP, the resolver knows which one of aplurality of servers is the root server. The resolver forwards therequest to the root server, which directs the resolver to search throughthe server hierarchy. In one embodiment, replication overlays are usedin addition to the hierarchy to avoid a bottleneck at the root serverand to increase the speed of the search. Therefore, a given serverwithin the hierarchy replicates the branch summaries of its siblings,its ancestors and its ancestors' siblings. Upon receiving a query, aserver evaluates the query against replicated summaries and directs theresolver to search corresponding remote servers when matches areidentified. Such replications let each server receive summaries thatcombine together to cover the whole hierarchy. Therefore, the resolvercan send the request to any server.

In one embodiment, a push mode is used to discover the desiredinformation. The push mode includes three modules, the match server, thesubscriber acting as importers and the publisher acting as exporters.The match server provides three functions to subscribers—subscribe,unsubscribe and renew. Each subscription has an associated lifetime.After the lifetime expires, the associated subscription is removed fromthe system. In one embodiment, the subscriber submitting thesubscription specifies the associated lifetime. In addition, thesubscriber can renew the lifetime of a previous submitted subscription.In one embodiment, a single centralized server handles all subscriptionsand matches published events against existing subscriptions.

The single centralized server optimizes the matching for a plurality ofsubscriptions by exploiting the common triples in the subscriptions.When several subscriptions all have the same triples, for example, ?xlocatedin NY, the centralized server reasons once and uses theintermediate results for all subscriptions. The centralized servermaintains a mesh of distinct triple patterns from all subscriptions. Thedistinct triple patterns in the mesh are ranked selectivity, i.e., howmany potential triples match a given triple pattern, and popularity,i.e., how frequent a given triple pattern appears in subscriptions. Anorder of evaluation of the triple patterns is determined that minimizesmatching response time. As existing subscriptions expire and newsubscriptions are submitted, the ranked mesh is updated accordingly.

In one embodiment, monitoring and recovery are provided for cooperatingstream processing jobs distributed across multiple sites. Individual jobfailures within a single cooperative data stream processing system siteare recoverable within that site. However, a failure of an entire siterequires distributed support. Referring to FIG. 5, an exemplaryembodiment of a site failover arrangement 500 for use with thecooperative data stream processing system in accordance with the presentinvention is illustrated. As illustrated, the cooperative data streamprocessing system includes five sites. These five sites workcooperatively to execute a distributed plan for supporting failurerecovery. Each site provides one or more of a plurality of functions forfailure recovery. A first site 502 functions as the failure recoveryplan owner. The distributed plan 503 is communicated to the plan ownersite 502, and the plan owner site drives the execution of thedistributed plan job execution. A second site 504 and a third site 508provide for job execution by hosting jobs that are part of thedistributed plan, and a fourth site 510 provides for job backup to hostjobs from failed job execution sites. A fifth site 506 providesmonitoring of other sites for site failure. Some of the sites canprovide more than one function. For example, the first site functions asthe plan owner and as the execution site for some of the jobs includedin the plan. Similarly, the fifth site 506 monitors the execution sitesand functions as a backup execution site. The input to the five sites isthe representation of a distributed plan 503, which is assumed to beexecuting to satisfy an inquiry entered by a user of the cooperativedata stream processing system. The distributed plan describes how theinquiries are divided into individual jobs that will run on thedifferent sites within the system.

In one embodiment, each site contains a single instance of thecomponents of the architecture of the cooperative data stream processingsystem. In addition each site, in order to support distributedoperation, includes a site server, a VO manager, a failover manager, ajob manager proxy, a tunneling manager. The VO manager manages the sitesthat are available to play monitoring and backup roles in support of thedistributed plan. In addition, the VO manager manages agreements betweensites. The failover manager chooses the specific sites to assume monitorand backup roles and orchestrates the monitoring and notification ofsite failures between the sites. The job manager proxy is a wrapperaround the JMN component of the cooperative data stream processingsystem, allowing jobs to be invoked remotely from other sites. Thetunneling manager provides the mechanism to transport data streamsbetween sites.

In one embodiment, the distributed plan is interpreted by a site withinthe system that will drive the execution of the plan and that will actas the plan owner. This site can be a lead site in a federated VO or apeer site in a cooperative VO that has taken on a leadership role forthis distributed plan. The set of sites that will function as the jobexecution sites are specified in the distributed plan. Next, themonitoring sites that will monitor the health of the job execution sitesare chosen. This selection can be hard programmed into one or more sitesor can be selected, for example using the VO manager located on the planowner site. This VO manager checks for sites that are willing to providemonitoring capability according to the CIP associated with the VO.Specific sites are chosen through interaction between the failovermanager on the plan owner site and failover manager counterparts onother sites. Agreements to monitor are created between the plan ownersite and the VO managers of the monitoring sites. The selection of jobbackup sites that take over the execution of critical jobs upon a sitefailure is made through methods similar to the selection of monitoringsites. In one embodiment, the selection of backup sites is made ahead oftime in advance of a site failure. Alternatively, the selection isdeferred until a failure occurs, and backup sites are chosen on demand.Agreements to backup are also obtained from these sites.

In preparation for the execution of the distributed failover plan,heartbeat monitoring is initiated by the failover managers on theappropriate sites. In addition, the tunneling managers on theappropriate execution sites are alerted to prepare for tunneling inaccordance with the tunneling requirements defined in the distributedfailover plan. Because the distributed failure plan has broken thelogical plan into disjointed fragments, the tunneling requirements tellthe tunneling managers how to associate the tunneled streams to the PEson their respective sites. Separate jobs are deployed by the tunnelingmanager instance located on each site involved to provide the necessarytunneling support. In further preparation, the actual jobs thatimplement the distributed failover plan are deployed to the sites thatwill host those jobs. The plan owner site uses the job manager proxyinstance located on each of the hosting sites to deploy the jobs. Uponsuccessful initiation of these jobs, the execution of the distributedplan begins. Data flow between PEs on each hosting site, and these PEsperform their analysis on the data. Data streams also flow from certainPEs on one originating site through tunnels to other destination sitesand are routed to the appropriate PEs on these destination site. Inaddition, the subjobs that constitute the distributed plan are able tooptionally checkpoint state that may be used later in order to recoverfrom a failure.

When an execution site fails, the failure is detected through theheartbeat monitoring performed by the monitoring site responsible forthis execution site. In one embodiment, the failover manager instance onthe monitoring site notifies the failover manager instance on the planowner site of the failure. The plan owner site works to recover anycritical subjobs that were executing on the failed site. In oneembodiment, the owning site uses its representation of the distributedplan and initially halts any tunneling that involves the failed site.The sites that were exchanging data with the failed site are informed tostop all tunneling activity with the failed site. New monitoringagreements are created for monitoring, if necessary, and heartbeatmonitoring is initiated on the backup sites. The tunneling managerinstances on new, i.e., backup, execution sites and on the executionsites affected by this site failure are notified to prepare fortunneling, resulting in new or reconfigured tunneling jobs. The criticalsubjobs from the failed site are deployed to one or more backup sites,and the execution of these subjobs is resumed on these sites. In oneembodiment, the execution of these subjobs is resumed by readingcheckpointed state from distributed storage. The distributed plan is nowrestored to its intended state. In one alternative embodiment, thefailure notification is configured to directly notify the backup sites,allowing these sites to initiate recovery. In this embodiment, there isno plan owner other than the site that failed. Therefore, instead ofrunning a subjob having an owner, which spawned it, a backup site hasthe information to recover a failed subjob even though it did notinitiate the subjob earlier.

The cooperative data stream processing system architecture supportsmultiple cooperation paradigms, including federated and cooperative(peer-to-peer) VOs. In addition, hierarchical layers of VOs providearbitrary scalability. The distributed planning component of thecooperative data stream processing system is significantly moreelaborate and flexible than the Grid models. Failure recovery utilizesother sites to survive both partial and total site failures and toenable critical processing to continue. Unlike Grid computing, thecooperative data stream processing system is intended to run under astate of overload and, potentially, to drop processing or data asdictated by overall system priorities.

The cooperation among cooperative data stream processing system sitesencompasses a variety of interaction models, from loosely coupled totightly integrated. These various models address different levels ofcooperation needs of sites with varying degrees of trust relationship,and inter-site heterogeneity. The cooperative data stream processingsystem supports generic application-specific processing rather thandatabase operations, a more difficult problem due to higher complexity,development costs and times to completion. A discussion is found inMichael Stonebraker, Ugur çetintemel, and Stanley B. Zdonik, The 8Requirements of Real-Time Stream Processing, SIGMOD Record, 34(4):42-47(2005). Moreover, the cooperative data stream processing system has anInquiry Specification Language that allows users to specify applicationdeclaratively at the semantic level, allowing users focus on applicationlevel tasks, rather than deal with the complexity of finding the optimumset and interconnection of data sources and PEs. With regard to failurerecovery, the cooperative data stream processing system emphasizespolicies such as optimizing the selection of backup sites, providing abalance between the goals of different sites and incorporating existingunderlying failure recovery mechanisms.

In one embodiment, the present invention provides for the distributedexecution of jobs across the plurality of distributed sites in thecooperative data stream processing system. Therefore, a given jobtranslated from a distributed plan derived from a user-defined inquiryis executed on multiple sites within the system. At least onedistributed plan is provided that contains the requirements for thedistribution and execution of jobs across the plurality of distributedsites within the cooperative data stream processing system. Thedistributed plan describes how jobs are divided up into individualsubjobs, i.e. applications, that are deployed to and executed on thedifferent sites. Referring to FIG. 6, an exemplary embodiment of the useof a distributed plan 600 for the execution of jobs across a pluralityof distributed sites is illustrated. The distributed plan contains therequirements for taking a given user-defined inquiry 602 and identifyinga distributed plan that is translated into one or more jobs 604 from theinquiry. The distributed plan also provides for the identification ofthe processing elements 606 that constitute each job. In accordance withthe distributed plan, these processing elements 606 are arranged into aplurality of subjobs or applications 608 for deployment one or more ofthe distributed sites 614 within the cooperative data stream processingsystem. The distributed plan defines the subjobs in accordance with theprocessing and data stream requirements of each processing element andthe processing and data stream resources located at each distributedsite.

In general, each one of the plurality of distributed sites contains asingle, independent instance of the components of the cooperative datastream processing system that make it possible for each site toindependently execute subjobs deployed to that site. These componentsinclude a site server, a remote execution coordinator (REC), a VOmanager, a failover manager, a job manager proxy and a tunnelingmanager. The site server facilitates messaging between sites and brokersthe components of a given local site to remote site clients. The REC isused to implement most of the distributed execution logic for thesubjobs deployed on the site. The VO manager manages the sites that areavailable to provide monitoring and back-up roles in support of thedistributed plan and manages agreements between sites in support ofthese monitoring and back-up roles. The failover manager identifies andselects the specific sites to provide monitoring and back-up support andorchestrates the monitoring and notification of site failures betweenthe sites. The job manager proxy, which in one embodiment is a wrapperaround the JMN component of the cooperative data stream processingsystem, allows jobs to be invoked remotely from other sites. Thetunneling manager provides the mechanism to communicate data streamsbetween processing elements running on different sites.

In order to provide for the execution of jobs across the plurality ofsites, the identified distributed plan is communicated to a given site616 within the system. This site is referred to as the distributed planowner site. The owner site interprets the distributed plan and drivesthe execution of the distributed plan, acting as the owner of thedistributed plan. In one embodiment, as illustrated, the owner site is alead site in a federated VO 618. However, the owner site can also be apeer site in a cooperative VO 620 that has taken on a leadership rolefor the distributed plan. The distributed plan identifies a plurality ofsites within the system for the execution of subjobs, i.e. executionsites. The distributed plan maps the subjobs to the execution sites, Asillustrated, the distributed plan identifies a first execution site 622to which a first subjob 610 has been mapped for execution and a secondexecution site 624 to which a second subjob 612 has been mapped forexecution. As illustrated, only two execution sites and two subjobs havebeen identified; however, any number of execution sites and subjobs canbe specified in the distributed plan in accordance with the number ofinquiries handled by the distributed plan. The owner site and subjobsites, as well as any other supporting sites such as monitoring andback-up sites are in communication in accordance with the requirementsand limitations of the VO's to which these site belong. In general,these sites do not interact in ways that are not permitted by the CIPspecification for the VO to which the sites belong. In one embodiment,the CIPs allow the necessary interactions between sites to facilitateexecution of the distributed plan.

In one embodiment, execution of the distributed job is driven by the RECon the owner site 616. If the owner site is not in communication withone or more of the job execution sites, 622, 624, initial contact ismade through the site servers located on the execution site, using, forexample, information from the CIP.

In addition to providing for the identification of processing elementsfrom the jobs, the associating of these processing elements intosubjobs, the mapping of those subjobs to execution sites and thedelivery and deployment of the subjobs on the execution sites, thedistributed plan provides for the monitoring and failover support of theexecution sites in accordance with the cooperative data streamprocessing system of the present invention as described herein. In oneembodiment, the distributed plan provides for the identification andselection of one or more monitoring sites 626 and one or more back-upsites 628 for each execution sites. The monitoring and execution sitescan be the same sites or different sites, and a given monitoring orback-up site can be used to monitor or back-up one or more executionsites. In one embodiment, the VO manager on the owner site determinesthe monitoring sites by checking which sites in the VO are willing toprovide monitoring capability according to the VO's CIP. Specific sitesare chosen through interaction between the failover manager on the ownersite and the failover manager counterparts on other sites. Havingidentified monitoring sites, agreements to monitor are created betweenthe owner site and the VO managers of the sites providing themonitoring. The back-up sites that will take over the execution ofcritical jobs upon a partial or complete site failure are also chosen inaccordance with the steps used to identify, select and secure monitoringsites. Agreements between sites for back-up support are also obtained.In one embodiment, the selection of back-up sites is made in advance inaccordance with the distributed plan. Alternatively, the back-up sitesare made on demand after the occurrence of a failure.

Having identified the subjobs, mapped the subjobs to execution sites andprovided for monitoring and back-up of the execution sites, the subjobsare deployed to the executions sites for execution in accordance withthe distributed plan. In one embodiment, in order to prepare for theexecution of subjobs in accordance with the distributed plan, heartbeatmonitoring is initiated by the failover managers on the appropriatemonitoring and execution sites. In addition, the tunneling managers onthe appropriate execution sites are alerted to prepare for tunneling.Because the distributed plan has broken up the logical plan intodisjointed fragments, the tunneling requirements tell the tunnelingmanagers how to establish tunnels 630 between PEs on the respectiveexecution sites in order to exchange data streams between the PEs. Inone embodiment, separate jobs, i.e. jobs that support tunneling, arerunning on behalf of the tunneling manager on each execution siteinvolved to provide the necessary tunneling support. A set of tunnelingrequirements that are part of the specification of the distributed planare communicated to execution sites and in particular to the tunnelingmanagers on the execution sites. The tunneling manager on each executionsite uses the tunneling requirements to configure an end of the tunnelas needed to support the execution of the distributed plan.

The subjobs derived from the jobs that implement the distributed planare deployed by to the execution sites to which the subjobs where mappedin accordance with the distributed plan. The REC on the owner siteinteracts with the REC on each of the execution sites to which subjobshave been mapped to deploy the subjobs, and hence the jobs from whichthe subjobs were derived. Upon successful initiation of the subjobs,execution of the distributed plan begins. Data flows between PEs on eachexecution site, and the PEs perform the prescribed analysis on the datastreams. Data streams also flow from PEs on a first execution sitethrough one or more tunnels to other sites and are routed to theappropriate PEs on the destination site. Although illustrated with asingle distributed plan and a single inquiry, methods for thedistributed execution in accordance with the present invention can beused with a plurality of distributed plans on a plurality of inquiriesdeployed and executed concurrently by the cooperative data streamprocessing system.

In accordance with one exemplary embodiment, the present invention isdirected to methods for abstracting or virtualizing VOs as a way ofallowing resource sharing among multiple sites and VOs within thecooperative data stream processing system. As was described above, agiven VO contains a combination of sites that collaborate to shareresources, i.e., data and processing resources as well as storagecapacity, in accordance with one or more CIPs. Each site within a givenVO contributes resources to the VO as a whole. Any other site within thesame VO can request access to the resources, possibly through anagreement that binds the two sites to a particular sharing arrangementfor a specific time interval. Sharing of resources enables computationthat is too extensive for individual sites to solve alone.

In accordance with the present invention, VOs are abstracted in the sameway VOs abstract physical organizations, i.e., sites. The resultinginteroperations of VOs are referred to as virtual virtual organizations(VVOs or V²Os). V²Os allow already formed collaborations to cooperatewith each other in their common interest. Thus the association is notlimited to individual sites, but any level of existing V²Os, includingsites and base VOs, recursively. The architecture enables richinteraction models among interoperations of VOs with varying degrees oftrust, purposes of cooperation, scales and resource availability.

In one embodiment, interoperations of VOs are created dynamically basedon ad-hoc collaborations among VOs. A variety of interaction modelsamong the VOs are supported including both federated in which sites andV²Os relinquish some control to a common authority, i.e., lead VO, andcooperative containing equal peers with no common authority. In oneembodiment, the collaborations are tightly-coupled with significantinter-site data transfer. Alternatively, the collaborations are looselyrelated with as little inter-site communication as possible.

Interoperations of VOs in accordance with the present invention arescalable to include thousands of sites and hundreds of VOs, or more. Inaddition, participation of a given VO in a V²O does not preclude that VOfrom participating in additional V²Os. A given VO or site canparticipate in hundreds of simultaneous collaborations.

The resources shared among the VOs and sites can be heterogeneous,varying substantially from VO to VO. Therefore, methods in accordancewith the present invention provide for flexible management of theseheterogeneous resources. In addition to static, i.e., space-shared, anddynamic, i.e., time-shared, allocation of resources, resources can beallocated based on a competition among VOs and sites desiring access tothese resources. This is referred to as best-effort access to resources.In one embodiment, VOs and sites can request exclusive control ofresources or can be granted best-effort access to resources shared withothers, depending on the policies governing interaction between the VOsand sites.

Individual sites and VOs within the cooperative data stream processingsystem of the present invention include processing, storage and othercomputing resources. Suitable sites include, but are not limited to,single computers and hundreds or thousands of computation nodes. VOs andV²Os support the extension of the functionality of a single site toprovide access to data sources, processing, storage, and other resourcesacross multiple sites and through the collaboration of multiple VOs.Interoperations of VOs support a greater degree of dynamic interactionsbetween sites and VOs that are members of the interoperation of VOs. Inone embodiment, VOs and sites propose new relationships that areinstantiated automatically without human intervention as long as thesenew relationship conform to existing CIPs governing the relationshipsbetween sites and VOs within the V²Os. The relationships between sitesand VOs can be long-lasting or transient, for example to address aparticular issue over a fixed time period.

The degree of authority given to any site of VO within the V²O can bevaried from a distributed model for resource allocation to a centralizedmodel for resource allocation. In the centralized model, a lead site orVO is selected from the combinations of sites and VOs that make up theV²O. This lead VO can requisition resources from other sites and VOswithin the V²O. In the distributed model, resource allocation decisionsare conducted directly from a site or VO requesting a resource to a siteor VO that owns the desired resource. Whether centralized ordistributed, however, individual sites and VOs within an interoperationof VOs generally maintain a high degree of autonomy.

In one embodiment, a lesser degree of autonomy is utilized at a givensite or VO. For example, a given site or VO can cede a degree of controlto another site or VO, either in general for any purpose that the firstsite may need or specifically for use in the case of a prescribedemergency. In one example of a common authority, a streaming dataanalysis system is operated by an agency, for example the U.S. FederalEmergency Management Agency (FEMA), to analyze weather data in the daysleading up to a storm event. The models utilized by the agency predictmajor flooding as a result of the pending storm event. Therefore,additional inquiries are needed to determine proper emergency responsesto the predicted major flooding, i.e., evacuation routes, availabilityof emergency supplies such as water and gasoline and locations ofdisaster response personnel. The agency can be viewed as a site or VO,and the resources available to the agency may be inadequate to processinquiries covering emergency responses. Other sites or VOs exists,however, such as other government agencies that have the necessaryresources to process the emergency response inquiries. These resourcesat the other agencies can be used to perform the desired analysis, forexample through agreements between the agencies or establishing a V²Othat includes all of the agencies and the CIPs that govern theallocation of resources within all of the member sites and VOs. Sinceall of the agencies are part of the federal government, the V²O can bethe federal government and sharing of the allocation of the resources isfacilitated by the federal government telling the various agencies toperform the desired inquiry analysis. Therefore, a federatedarchitecture is utilized, and a set of CIPs are used that dictate whatresources are available under what circumstances. Sites and VOs withinthe interoperation of VOs can be loosely coupled together or moretightly coupled together. In one embodiment, cooperative data streamprocessing systems require tighter coupling and optimization to takenetwork locality into account. In particular, problems cannot besubdivided arbitrarily but are decomposed into pieces that can beeffectively distributed among available sites or VOs within theinteroperation of VOs in the presence of both networking constraints andsharing policies.

Referring to FIG. 7, an exemplary embodiment of an interoperation of VOs(V²O) 700 in accordance with the present invention is illustrated. Asillustrated, the V²O utilizes a federated architecture; however,cooperative architectures can also be used. In order to create a firstor main interoperations of VO 702, combinations of sites, VOs andexisting V²Os are identified. As illustrated, the main V²O 702 includesa combination of existing V²Os 704. In one embodiment, the main V²O is aparent corporation, and each one of the V²Os in the combination of V²Osis a subsidiary or division of the parent. The combinations of existinginteroperations of VOs include a first V²O 706 and a second V²O 708.Within the existing V²Os are other VOs 710 and sites 712 that canrepresent smaller divisions within the organization. As illustrated, thefirst V²O runs a multi-site cooperative data stream processing system aswell as additional sites. Each V²O is arranged with federatedarchitecture, therefore, the main interoperation of VOs and eachexisting interoperation of VOs contained within the main interoperationof VOs includes an identified lead site 714. The degree of cooperationamong the VOs and sites within each existing V²O can vary; however, allof the sites and VOs answer to the main V²O in the federatedarchitecture. Thus, the main lead site 716 requisitions resources fromboth the first and second V²Os for purposes of a main V²O wide analysis.Alternatively, the main lead site directs the first V²O to provideresources to the second V²O.

The first and second V²Os that are contained within the main V²O can beeither opaque or transparent to the main V²O. For an opaque V²O, themembership, i.e. sites and VOs, within the V²O is not exposed to themain V²O. Therefore, the lead sites within each member V²O participatein the main V²O. These lead sites 714 acts as gateways between the mainV²O and the contained V²O, for example, masquerading as a single largesite. As illustrated, the first V²O lead site 718 takes requests fromthe main lead site 716, passes the requests to any lead sites forinternal VOs, and has the necessary resources allocated by the low-levelsites. For a transparent arrangement, the membership of the containedinteroperation of VOs is visible to the main interoperation of VOs.Therefore, the main lead site 716 allocates all resources directly.Transparency, however, is less scalable since a single site individuallycontrols or interacts with every site and VO within the entire V²O,However, transparency offers opportunities for better optimization ofthe available resources.

Referring to FIG. 8, an exemplary embodiment of a cooperativecollaboration among multiple interoperations of VOs 800 is illustrated.The cooperative collaboration includes interoperations of VOs 802. Inone embodiment, agreements exist between the V²Os that permit them tointeroperate. However, a given V²O, for example a given company, isunlikely to give another company authority to directly access or controlits resources. In other words, the relationship among these three V²Osis cooperative, not federated. However, each V²O can have its ownfederated V²O containing various sites, VOs and V²Os within the sameV²O. A request from a first V²O 804 to a second V²O 806 to access aparticular resource may be propagated by a top-level site within thesecond V²O to some other site within the second V²O or to another leadsite with the second V²O. In addition to the three V²Os forming a givencollaboration 812, the third V²O 808 may have a separate agreement withthe second V²O 806 that forms a separate V²O 812 that does not includethe first V²O 804. In general, a V²O containing a subset of another V²Oprovides greater resource access to the members of the smaller V²O.Otherwise, all of the V²Os would simply use the single larger V²O toaccess resources.

In one embodiment, sites and VOs can participate in a plurality of V²Osconcurrently, having ongoing agreements with different combinations ofsites and VOs simultaneously, as long as the agreements are not inconflict. Therefore, a given VO can be a member of a firstinteroperation of VOs and a second interoperation of VOs. In the dynamicsystem of V²O creation and modification, agreements among sites and VOsare initiated, revised and terminated over time. Therefore, long-termfirm commitments for specific resource allocations should only begranted when the priority of these allocations warrants such acommitment. For example, a federated lead site might require a long-termfixed allocation. Other agreements need to be more transient and utilizebest-effort allocation mechanisms. Therefore, limitations are placed onthe duration of any given commitment for resource allocation, untilrenegotiated. Sites and VOs within a V²O balance the requirements of thedifferent V²Os in which they participate.

Allocation of the resources contained with a given site or VO within thecombination of sites and VOs that constitute a V²O can be static,dynamic or best-effort. Static allocation of resources from one site orVO for the use of a second site or VO is partition based and explicitlypartitions the resource. Dynamic allocation of resources is SLA basedand divides the resources dynamically to achieve a particular goal, suchas 30% of processor cycles. Best-effort allocation lets all remote sitesor VOs within a given interoperation of VOs compete with the site or VOthat owns a given resource to share that resource. Preferably,best-effort allocation is used to allocate resources within a giveninteroperations of VOs. In one embodiment, all resources are generallyhighly utilized, and the cooperative data stream processing systemmoderates among competing sites and VOs contending for shared resources.The CIPs associated with the V²O dictate how to control access. Forexample, the priority of a job from a first VO within a second VO isadjusted such that the job of the first VO only gets processingavailable after a specific job from the second VO executes. In oneembodiment, CIPs specify static resource allocation and fractionalresource targets.

In one embodiment, resource allocation is task-based. A given high-leveltask, i.e., a job derived from a user-submitted inquiry, is partitionedinto smaller execution units, i.e., processing elements, that aredistributed among the sites and VOs of the interoperation of VOs. Inanother embodiment, resource allocation is goal-based. The CIPs usedamong the combination of sites and VOs indicate whether sites and VOscan request execution resources explicitly, i.e., run a given job, orimplicitly, i.e., solve a particular subgoal and determine the resourcesrequired to solve. By dividing a goal into sub-goals and delegatingother sites or VOs to address those subgoals, a site or VO mixes andmatches the capabilities of different sites and VOs within the V²O withwhich that site or VO has relationships. A key constraint for goal-basedresource allocation is to prevent leakage of data from one site or VOinto another without permission. A site or VO should not pass data orresources it gains from one VO to another VO without explicit consent.To utilize resources from multiple VOs, a given site or VO needs toprocess data within each VO individually, then collect and combine theprocessed result itself to produce the final result.

Just as resources within individual administrative domains, that issites, are made available to a larger community through VOs, resourceswithin VOs are shared with a larger community via interoperations ofVOs. In one embodiment, the sites and VOs are treated interchangeablyfrom the standpoint of making agreements to provide or obtain resources.Just as a site can agree to provide N processors of a certain type for acertain interval, a VO can agree to provide the resources of itsconstituent member sites. If the providing VO simply takes a unit ofwork and provides results, the resource allocation within the VO ishidden from the requesting party. Alternatively, the VO can expose theinternal structure to the requesting party, for instance by allocatingresources within specific sites in the sub-VO and providing thoseresources explicitly to the requesting site.

In one embodiment, VOs are the unit of negotiation for establishingCIPs. In the illustration of FIG. 8, if a fourth V²O having acombination of sites and VOs wants to collaborate with the other threeV²Os, this collaboration can take several forms. In one embodiment, thefourth V²O negotiates to create a new V²O containing the fourth V²O andthe existing collaboration of the three V²Os. This is suitable if theexisting collaboration has a stronger relationship internally than thosecompanies have with the fourth V²O. In another embodiment, the fourthV²O negotiates to join the existing collaboration of the three V²Os.Each V²O within this renegotiated collaboration has the same rights tothe each member V²O. This is suitable if all the four V²Os have the samedegree of trust among themselves. In another embodiment, the fourth V²Onegotiates to have one or more sites or VOs within the fourth V²O join aVO with the existing V²Os (either a new one or the existing one). Thisis suitable I the fourth V²O allows sufficient autonomy for its members,thus they can make decisions themselves and negotiate CIP terms.

The creation of virtualized virtual organizations containingcombinations of sites and virtual organizations in accordance with thepresent invention facilitates the use of a cooperative interaction modelwhen no common authority exists. In addition, the member sites and VOsof a given interoperations of VOs can be members of multiplesimultaneous collaborations. Moreover, all users within a VO can bedifferentiated and treated differently for purposes of resourceallocation.

Methods and systems in accordance with exemplary embodiments of thepresent invention can take the form of an entirely hardware embodiment,an entirely software embodiment or an embodiment containing bothhardware and software elements. In a preferred embodiment, the inventionis implemented in software, which includes but is not limited tofirmware, resident software and microcode. In addition, exemplarymethods and systems can take the form of a computer program productaccessible from a computer-usable or computer-readable medium providingprogram code for use by or in connection with a computer, logicalprocessing unit or any instruction execution system. For the purposes ofthis description, a computer-usable or computer-readable medium can beany apparatus that can contain, store, communicate, propagate, ortransport the program for use by or in connection with the instructionexecution system, apparatus, or device. Suitable computer-usable orcomputer readable mediums include, but are not limited to, electronic,magnetic, optical, electromagnetic, infrared, or semiconductor systems(or apparatuses or devices) or propagation mediums. Examples of acomputer-readable medium include a semiconductor or solid state memory,magnetic tape, a removable computer diskette, a random access memory(RAM), a read-only memory (ROM), a rigid magnetic disk and an opticaldisk. Current examples of optical disks include compact disk-read onlymemory (CD-ROM), compact disk-read/write (CD-R/W) and DVD.

Suitable data processing systems for storing and/or executing programcode include, but are not limited to, at least one processor coupleddirectly or indirectly to memory elements through a system bus. Thememory elements include local memory employed during actual execution ofthe program code, bulk storage, and cache memories, which providetemporary storage of at least some program code in order to reduce thenumber of times code must be retrieved from bulk storage duringexecution. Input/output or I/O devices, including but not limited tokeyboards, displays and pointing devices, can be coupled to the systemeither directly or through intervening I/O controllers. Exemplaryembodiments of the methods and systems in accordance with the presentinvention also include network adapters coupled to the system to enablethe data processing system to become coupled to other data processingsystems or remote printers or storage devices through interveningprivate or public networks. Suitable currently available types ofnetwork adapters include, but are not limited to, modems, cable modems,DSL modems, Ethernet cards and combinations thereof.

In one embodiment, the present invention is directed to amachine-readable or computer-readable medium containing amachine-executable or computer-executable code that when read by amachine or computer causes the machine or computer to perform a methodfor creating an interoperation of virtual organizations in a cooperativedata stream processing system in accordance with exemplary embodimentsof the present invention and to the computer-executable code itself. Themachine-readable or computer-readable code can be any type of code orlanguage capable of being read and executed by the machine or computerand can be expressed in any suitable language or syntax known andavailable in the art including machine languages, assembler languages,higher level languages, object oriented languages and scriptinglanguages. The computer-executable code can be stored on any suitablestorage medium or database, including databases disposed within, incommunication with and accessible by computer networks utilized bysystems in accordance with the present invention and can be executed onany suitable hardware platform as are known and available in the artincluding the control systems used to control the presentations of thepresent invention.

While it is apparent that the illustrative embodiments of the inventiondisclosed herein fulfill the objectives of the present invention, it isappreciated that numerous modifications and other embodiments may bedevised by those skilled in the art. Additionally, feature(s) and/orelement(s) from any embodiment may be used singly or in combination withother embodiment(s) and steps or elements from methods in accordancewith the present invention can be executed or performed in any suitableorder. Therefore, it will be understood that the appended claims areintended to cover all such modifications and embodiments, which wouldcome within the spirit and scope of the present invention.

1. A method for creating an interoperation of virtual organizations in acooperative data stream processing system, the method comprising:identifying a plurality of distributed sites, each site comprisingcomponents capable of independently processing continuous dynamicstreams of data; identifying a plurality of virtual organizations, eachvirtual organization comprising a combination of sites selected from theidentified plurality of distributed sites and configured to share atleast one of data and processing resources within the combination ofsites; and creating a first interoperation of virtual organizationscomprising a first group of virtual organizations selected from theidentified plurality of virtual organizations, wherein each virtualorganization within the first interoperation of virtual organizations isconfigured to share at least one of data and resources with othermembers of the first group of virtual organizations.
 2. The method ofclaim 1, wherein the first interoperation of virtual organizationsfurther comprises additional sites selected from the identifiedplurality of distributed sites.
 3. The method of claim 1, wherein thefirst interoperation of virtual organizations further comprises at leastone existing interoperation of virtual organizations capable of sharingat least one of data and resources with the first interoperation ofvirtual organizations, each existing interoperation of virtualorganizations comprising at least one of member virtual organizationsand member sites.
 4. The method of claim 1, further comprising creatinga second interoperation of virtual organizations comprising a secondgroup of virtual organizations, wherein at least one of the plurality ofidentified virtual organization is a member of both the first and secondinteroperation of virtual organizations.
 5. The method of claim 1,further comprising: identifying a lead virtual organization in the firstgroup of virtual organizations; and using the identified lead virtualorganization to manage the first interoperation of virtualorganizations.
 6. The method of claim 1, wherein the virtualorganizations in the first group of virtual organizations areheterogeneous.
 7. The method of claim 1, further comprising using commoninterest policies among the virtual organizations in the first group ofvirtual organizations to define relationships among the virtualorganizations in the first group of virtual organizations.
 8. The methodof claim 7, wherein the step of using common interest policies furthercomprises using common interest policies to define the interoperation ofvirtual organizations, to identify the first group of virtualorganizations, to identify resource allocation constraints for eachvirtual organization within the first group of virtual organizations, toidentify sharing relationships among virtual organizations within thefirst group of virtual organizations, to provide heterogeneity mappingbetween virtual organizations in the first group of virtualorganizations, to provide communication details between virtualorganizations in the first group of virtual organizations orcombinations thereof.
 9. The method of claim 1, wherein the first groupof virtual organizations comprises a cooperative architecture.
 10. Themethod of claim 9, further comprising allocating resources in the firstgroup of virtual organizations directly from a virtual organizationscontaining resources to be allocated to virtual organizations requestingthe allocated resources.
 11. The method of claim 1, wherein the firstgroup of virtual organizations comprises a federated architecture. 12.The method of claim 11, further comprising identifying a lead virtualorganization in the first group of virtual organizations for thefederated architecture.
 13. The method of claim 12, further comprisingusing the lead virtual organization to control allocations of the shareddata and resources among all virtual organizations in the first group ofvirtual organizations.
 14. The method of claim 3, wherein each existinginteroperation of virtual organizations comprises a federatedarchitecture; and the method further comprises further comprises:identifying lead virtual organizations within each existinginteroperation of virtual organizations; and using the identified leadvirtual organizations to participate in the first interoperation ofvirtual organizations
 15. The method of claim 14, wherein the step ofusing the identified lead virtual organizations to participate in thefirst interoperation of virtual organizations further comprises:communicating requests for shared data and resources and data from afirst interoperation of virtual organizations lead site to theidentified lead virtual organizations in the existing interoperations oflead sites; and using the identified lead virtual organizations toallocate the requested shared data and resources from member sites inthe existing interoperations of lead sites.
 16. The method of claim 3,wherein each existing interoperation of virtual organizations comprisesa federated architecture; and the method further comprises furthercomprises exposing member virtual organizations of the existinginteroperations of virtual organizations to the first interoperation ofvirtual organizations.
 17. The method of claim 16, further comprisingallocating data and resources in the member virtual organizationsdirectly from the member virtual organizations to the firstinteroperation of virtual organizations.
 18. The method of claim 1,wherein the step of creating the first interoperation of virtualorganizations further comprises utilizing a plurality of ad-hoccollaborations between the virtual organizations in the first group ofvirtual organizations to create the first interoperation of virtualorganizations dynamically.
 19. A method for creating an interoperationof virtual organizations in a cooperative data stream processing system,the method comprising: identifying a plurality of distributed sites,each site comprising components capable of independently processingcontinuous dynamic streams of data; identifying a plurality of virtualorganizations, each virtual organization comprising a combination ofsites selected from the identified plurality of distributed sites andconfigured to share at least one of data and processing resources withinthe combination of sites; and creating a first interoperation of virtualorganizations comprising a first group of virtual organizations selectedfrom the identified plurality of virtual organizations; and creating asecond interoperation of virtual organizations comprising a second groupof virtual organizations selected from the identified plurality ofvirtual organizations; wherein at least one virtual organization is amember of both the first and second groups of virtual organizations. 20.A computer-readable medium containing a computer-readable code that whenread by a computer causes the computer to perform a method for creatingan interoperation of virtual organizations in a cooperative data streamprocessing system, the method comprising: identifying a plurality ofdistributed sites, each site comprising components capable ofindependently processing continuous dynamic streams of data; identifyinga plurality of virtual organizations, each virtual organizationcomprising a combination of sites selected from the identified pluralityof distributed sites and configured to share at least one of data andprocessing resources within the combination of sites; and creating afirst interoperation of virtual organizations comprising a first groupof virtual organizations selected from the identified plurality ofvirtual organizations, wherein each virtual organization within thefirst interoperation of virtual organizations is configured to share atleast one of data and resources with other members of the first group ofvirtual organizations.